Skip to content

Conversation

@Shnatsel
Copy link
Collaborator

No description provided.

@codecov-commenter
Copy link

codecov-commenter commented Dec 12, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.85%. Comparing base (7b15bc2) to head (31954e7).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #58      +/-   ##
==========================================
+ Coverage   99.82%   99.85%   +0.03%     
==========================================
  Files          13       13              
  Lines        2258     2764     +506     
==========================================
+ Hits         2254     2760     +506     
  Misses          4        4              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Shnatsel Shnatsel mentioned this pull request Jan 21, 2026
@Shnatsel
Copy link
Collaborator Author

On Zen4 This gives up to 7% penalty due to not utilizing AVX-512, but otherwise looks normal. We don't need explicit mul_neg_add on x86 it seems, this is lowered into the correct instruction automatically.

On Apple M4 this is a large regression. The hottest instructions are loads/stores to/from the stack for f32x16, so it might be due to register pressure or some such (LLVM isn't great at dealing with that). I'll need to investigate how wide lowers this kind of thing to NEON, its approach is apparently better than that of fearless_simd. Or we could rewrite the function to operate on native vectors but then we might give up some ILP?

@valadaptive
Copy link

On Apple M4 this is a large regression. The hottest instructions are loads/stores to/from the stack for f32x16, so it might be due to register pressure or some such

This is a wild guess (I don't have Apple Silicon hardware, so I can't benchmark any of this), but the way you're loading from a slice looks a bit convoluted. Instead of e.g.

let in0_re = f32x4::simd_from(simd, <[f32; 4]>::try_from(&reals_s0[0..4]).unwrap());

have you tried simply:

let in0_re = f32x4::from_slice(simd, &reals_s0[0..4]));

Also just to confirm, you ran this with the latest fearless_simd from Git, correct? linebender/fearless_simd#159 aimed to improve codegen around SIMD loads, and linebender/fearless_simd#181 just landed a couple days ago and adds (potentially) faster methods for SIMD stores.

@Shnatsel
Copy link
Collaborator Author

Shnatsel commented Jan 21, 2026

Yep, this is on latest fearless_simd from git. I'll see if from_slice does anything, it's certainly more readable.

I've also tried swapping vector repr from arrays to structs to mimic wide internal representation but it didn't make a difference.

@Shnatsel
Copy link
Collaborator Author

CI is broken in a really interesting way: it complains about mul_neg_add which doesn't appear anywhere in the code on the latest commit. It's either running on an old commit or on a different branch; either way that could be exploitable if it can be reproduced.

@Shnatsel
Copy link
Collaborator Author

Nope, no difference in performance from changing loads/stores. Looks like a readability win to me though.

@Shnatsel Shnatsel mentioned this pull request Jan 22, 2026
4 tasks
@Shnatsel
Copy link
Collaborator Author

Shnatsel commented Jan 22, 2026

I had Claude analyze the generated assembly from fft_dit_chunk_32_simd_f32, and it seems that the regression is due to over-eager inlining: https://claude.ai/share/920afa1e-8e39-48a1-b00f-1a55d6c25d1c

Load/store instructions are hot on the profiler (samply), so this makes sense as a regression due to stack spills. This also explains why all the kernels are gone from cargo asm view - they've all been aggressively inlined.

This is likely an artifact of fearless_simd's #[inline(always)]. We go though vectorize() to add a function call that isn't force-inlined but apparently that's not enough.

@valadaptive
Copy link

This is likely an artifact of fearless_simd's #[inline(always)]. We go though vectorize() to add a function call that isn't force-inlined but apparently that's not enough.

I wonder why it's making a different inlining decision. Was the function marked #[inline(never)] before?

I think we should probably just make vectorize not be inline over in fearless_simd.

…entirety of DiT kernels ends up rolled up into one function on ARM and collapses under register pressure
@Shnatsel
Copy link
Collaborator Author

Nope, preventing inlining didn't help performance. It seems functions that operate on 512-bit vectors like f32x16 are still much slower than with wide.

@Shnatsel
Copy link
Collaborator Author

Now that both aren't inlined, I can directly compare the generated assembly. This is almost the first time I see ARM assembly but what jumps out at me is that wide doesn't use the fmls instruction, it always does fmla followed by a fneg which is two instructions to do the same thing, and it's still much faster somehow.

Assembly of an affected function fft_dit_chunk_64_simd_f32 lifted from samply assembly view:

wide: wide_asm.txt

fearless_simd: fearless_asm.txt

Profiling data

Profile with wide (on main, exact commit): https://share.firefox.dev/3LNEZto

Profile with fearless_simd (#58, exact commit): https://share.firefox.dev/3LVrB6x

Both recorded with samply record -r 5000 cargo bench --profile=profiling --bench=bench 'Forward f32/PhastFT DIT/64'

@Shnatsel
Copy link
Collaborator Author

I can't really read Aarch64 assembly but Claude can, and Claude has an idea on what's going on: https://claude.ai/share/4fa3af4f-3c34-4d57-b11d-3611385f4f1c

That explains a lot if it's true

@valadaptive
Copy link

Try linebender/fearless_simd#184.

@Shnatsel
Copy link
Collaborator Author

linebender/fearless_simd#184 is a +20% boost on Apple M4! Thanks a lot!

This still isn't as fast as wide but that's a big improvement!

@Shnatsel
Copy link
Collaborator Author

New profile for fearless_simd: https://share.firefox.dev/3LOGBDc

New assembly for fft_dit_chunk_64_simd_f32: asm_64_f32_fixed_loads_asm.txt

@Shnatsel
Copy link
Collaborator Author

Shnatsel commented Jan 22, 2026

It looks like the load-store-load pattern for load_array is present even when loading from actual arrays that are just constants: https://claude.ai/share/4fa3af4f-3c34-4d57-b11d-3611385f4f1c

@valadaptive you'll probably want to drop load_array entirely and always use the implementation of load_array_ref on Aarch64 even for on-stack arrays because rustc doesn't align stack arrays with this load in mind

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants